Data Science has become one of the most important disciplines in the world. It is used in many fields, such as healthcare, finance, and transportation. In this notebook, we will explore the Stack Overflow survey data to understand the data science discipline and the developers who work in this field.

  1. Understand Data Scientist Developers
  2. Salary Expectations for data scientists
  3. Most popular languages, databases, and platforms among data scientists

About the dataset: The dataset we will use is the Stack Overflow survey data from 2024. It contains information about developers, their job roles, and the technologies they use. We will focus on the data scientists in this dataset. Find the dataset here: https://survey.stackoverflow.co/

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
import plotly.io as pio

pio.templates["custom_white"] = pio.templates["plotly"]
pio.templates["custom_white"]["layout"]["paper_bgcolor"] = "white"

pio.templates.default = "custom_white"
In [3]:
df = pd.read_csv("data/survey_results_public.csv")
df
Out[3]:
ResponseId MainBranch Age Employment RemoteWork Check CodingActivities EdLevel LearnCode LearnCodeOnline ... JobSatPoints_6 JobSatPoints_7 JobSatPoints_8 JobSatPoints_9 JobSatPoints_10 JobSatPoints_11 SurveyLength SurveyEase ConvertedCompYearly JobSat
0 1 I am a developer by profession Under 18 years old Employed, full-time Remote Apples Hobby Primary/elementary school Books / Physical media NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 I am a developer by profession 35-44 years old Employed, full-time Remote Apples Hobby;Contribute to open-source projects;Other... Bachelor’s degree (B.A., B.S., B.Eng., etc.) Books / Physical media;Colleague;On the job tr... Technical documentation;Blogs;Books;Written Tu... ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 3 I am a developer by profession 45-54 years old Employed, full-time Remote Apples Hobby;Contribute to open-source projects;Other... Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Books / Physical media;Colleague;On the job tr... Technical documentation;Blogs;Books;Written Tu... ... NaN NaN NaN NaN NaN NaN Appropriate in length Easy NaN NaN
3 4 I am learning to code 18-24 years old Student, full-time NaN Apples NaN Some college/university study without earning ... Other online resources (e.g., videos, blogs, f... Stack Overflow;How-to videos;Interactive tutorial ... NaN NaN NaN NaN NaN NaN Too long Easy NaN NaN
4 5 I am a developer by profession 18-24 years old Student, full-time NaN Apples NaN Secondary school (e.g. American high school, G... Other online resources (e.g., videos, blogs, f... Technical documentation;Blogs;Written Tutorial... ... NaN NaN NaN NaN NaN NaN Too short Easy NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65432 65433 I am a developer by profession 18-24 years old Employed, full-time Remote Apples Hobby;School or academic work Bachelor’s degree (B.A., B.S., B.Eng., etc.) On the job training;School (i.e., University, ... NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65433 65434 I am a developer by profession 25-34 years old Employed, full-time Remote Apples Hobby;Contribute to open-source projects NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65434 65435 I am a developer by profession 25-34 years old Employed, full-time In-person Apples Hobby Bachelor’s degree (B.A., B.S., B.Eng., etc.) Other online resources (e.g., videos, blogs, f... Technical documentation;Stack Overflow;Social ... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65435 65436 I am a developer by profession 18-24 years old Employed, full-time Hybrid (some remote, some in-person) Apples Hobby;Contribute to open-source projects;Profe... Secondary school (e.g. American high school, G... On the job training;Other online resources (e.... Technical documentation;Blogs;Written Tutorial... ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
65436 65437 I code primarily as a hobby 18-24 years old Student, full-time NaN Apples NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

65437 rows × 114 columns

In [4]:
df.columns
Out[4]:
Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       ...
       'JobSatPoints_6', 'JobSatPoints_7', 'JobSatPoints_8', 'JobSatPoints_9',
       'JobSatPoints_10', 'JobSatPoints_11', 'SurveyLength', 'SurveyEase',
       'ConvertedCompYearly', 'JobSat'],
      dtype='object', length=114)
In [5]:
selected_columns = [
		'Age',
		'Employment',
		'RemoteWork',
		'EdLevel',
		'LearnCode',
		'LearnCodeOnline',
		'YearsCode',
		'YearsCodePro',
		'OrgSize',
		'Country',
		'CompTotal',
		'DevType',
		'LanguageHaveWorkedWith',
		'LanguageWantToWorkWith',
		'LanguageAdmired',
		'DatabaseHaveWorkedWith',
		'DatabaseWantToWorkWith',
		'DatabaseAdmired',
		'PlatformHaveWorkedWith',
		'PlatformWantToWorkWith',
		'PlatformAdmired',
		'WebframeHaveWorkedWith',
		'WebframeWantToWorkWith',
		'WebframeAdmired',
		'EmbeddedHaveWorkedWith',
		'EmbeddedWantToWorkWith',
		'EmbeddedAdmired',
		'MiscTechHaveWorkedWith',
		'MiscTechWantToWorkWith',
		'MiscTechAdmired',
		'ToolsTechHaveWorkedWith',
		'ToolsTechWantToWorkWith',
		'ToolsTechAdmired',
		'NEWCollabToolsHaveWorkedWith',
		'NEWCollabToolsWantToWorkWith',
		'NEWCollabToolsAdmired',
		'OpSysPersonal use',
		'OpSysProfessional use',
		'OfficeStackAsyncHaveWorkedWith',
		'OfficeStackAsyncWantToWorkWith',
		'OfficeStackAsyncAdmired',
		'OfficeStackSyncHaveWorkedWith',
		'OfficeStackSyncWantToWorkWith',
		'OfficeStackSyncAdmired',
		'AISearchDevHaveWorkedWith',
		'AISearchDevWantToWorkWith',
		'AISearchDevAdmired',
		'AISelect',
		'AISent',
		'AIBen',
		'AIAcc',
		'AIComplex',
		'AIToolCurrently Using',
		'AIToolInterested in Using',
		'AIToolNot interested in Using',
		'AINextMuch more integrated',
		'AINextNo change',
		'AINextMore integrated',
		'AINextLess integrated',
		'AINextMuch less integrated',
		'AIThreat',
		'AIEthics',
		'AIChallenges',
		'Industry',
		'WorkExp',
		'JobSat',
]
In [6]:
df = df[selected_columns]
df
Out[6]:
Age Employment RemoteWork EdLevel LearnCode LearnCodeOnline YearsCode YearsCodePro OrgSize Country ... AINextNo change AINextMore integrated AINextLess integrated AINextMuch less integrated AIThreat AIEthics AIChallenges Industry WorkExp JobSat
0 Under 18 years old Employed, full-time Remote Primary/elementary school Books / Physical media NaN NaN NaN NaN United States of America ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 35-44 years old Employed, full-time Remote Bachelor’s degree (B.A., B.S., B.Eng., etc.) Books / Physical media;Colleague;On the job tr... Technical documentation;Blogs;Books;Written Tu... 20 17 NaN United Kingdom of Great Britain and Northern I... ... NaN NaN NaN NaN NaN NaN NaN NaN 17.0 NaN
2 45-54 years old Employed, full-time Remote Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Books / Physical media;Colleague;On the job tr... Technical documentation;Blogs;Books;Written Tu... 37 27 NaN United Kingdom of Great Britain and Northern I... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 18-24 years old Student, full-time NaN Some college/university study without earning ... Other online resources (e.g., videos, blogs, f... Stack Overflow;How-to videos;Interactive tutorial 4 NaN NaN Canada ... NaN NaN NaN NaN No Circulating misinformation or disinformation;M... Don’t trust the output or answers NaN NaN NaN
4 18-24 years old Student, full-time NaN Secondary school (e.g. American high school, G... Other online resources (e.g., videos, blogs, f... Technical documentation;Blogs;Written Tutorial... 9 NaN NaN Norway ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65432 18-24 years old Employed, full-time Remote Bachelor’s degree (B.A., B.S., B.Eng., etc.) On the job training;School (i.e., University, ... NaN 5 3 2 to 9 employees NaN ... NaN Learning about a codebase;Project planning;Doc... NaN NaN No Circulating misinformation or disinformation AI tools lack context of codebase, internal a... NaN NaN NaN
65433 25-34 years old Employed, full-time Remote NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65434 25-34 years old Employed, full-time In-person Bachelor’s degree (B.A., B.S., B.Eng., etc.) Other online resources (e.g., videos, blogs, f... Technical documentation;Stack Overflow;Social ... 9 5 1,000 to 4,999 employees NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65435 18-24 years old Employed, full-time Hybrid (some remote, some in-person) Secondary school (e.g. American high school, G... On the job training;Other online resources (e.... Technical documentation;Blogs;Written Tutorial... 5 2 20 to 99 employees Germany ... NaN NaN NaN NaN NaN NaN NaN NaN 5.0 NaN
65436 18-24 years old Student, full-time NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

65437 rows × 66 columns

In [7]:
import plotly.express as px

initial_dev_type_values = df["DevType"].value_counts()
sorted_values = initial_dev_type_values.sort_values(ascending=True)

fig = px.bar(sorted_values, 
             orientation='h', 
             title='Initial Developer Roles', 
             labels={'value': 'Number of Respondents', 'index': 'Developer Role'},
             color_discrete_sequence=['skyblue'])

fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Developer Role')

fig.show()

1. Understand Data Scientist Developers¶

To explain what data scientists have in common, we will split the DevType column, to look only on some specific roles like Data or business analyst, Data scientist or machine learning specialist, Data engineer, Developer, AI, Scientist. Then we will create a pie chart to visualize the distribution of these roles among data scientists.

Pie Chart¶

We can see on the Pie chart that the most common role among data scientists is Data Engineer, followed by Data scientist or machine learning specialist, ...

In [8]:
import plotly.graph_objects as go
import plotly.express as px

disciplines = [
    "Data or business analyst",
    "Data scientist or machine learning specialist",
    "Data engineer",
    "Developer, AI",
    "Scientist"
]

df_ai = df[df["DevType"].str.contains("|".join(disciplines), na=False)]
devtype_counts = df_ai["DevType"].value_counts()

fig = go.Figure(data=[go.Pie(labels=devtype_counts.index, 
                             values=devtype_counts.values, 
                             textinfo='percent+label', 
                             marker=dict(colors=px.colors.qualitative.Set3),
                             hole=.3)])

fig.update_layout(title_text='Developer Roles')

fig.show()

Industry¶

We'll create a pie chart to visualize the distribution of industries where data scientists work. This will help us understand the industries that are most likely to employ data scientists.

In [9]:
import plotly.graph_objects as go

industry_counts = df_ai['Industry'].value_counts()

fig = go.Figure(data=[go.Pie(labels=industry_counts.index, 
                             values=industry_counts.values, 
                             textinfo='percent+label', 
                             hole=.3)])

fig.update_layout(title_text='Distribution of Industries')

fig.show()

Now we'll remap the countries to continents, that will bed used later to visualize the distribution of data scientists by continent.

In [10]:
country_to_continent = {
    # Africa
    'Algeria': 'Africa',
    'Angola': 'Africa',
    'Benin': 'Africa',
    'Botswana': 'Africa',
    'Burkina Faso': 'Africa',
    'Burundi': 'Africa',
    'Cabo Verde': 'Africa',
    'Cameroon': 'Africa',
    'Central African Republic': 'Africa',
    'Chad': 'Africa',
    'Comoros': 'Africa',
    'Congo': 'Africa',
    'Djibouti': 'Africa',
    'Egypt': 'Africa',
    'Equatorial Guinea': 'Africa',
    'Eritrea': 'Africa',
    'Eswatini': 'Africa',
    'Ethiopia': 'Africa',
    'Gabon': 'Africa',
    'Gambia': 'Africa',
    'Ghana': 'Africa',
    'Guinea': 'Africa',
    'Guinea-Bissau': 'Africa',
    'Ivory Coast': 'Africa',
    'Kenya': 'Africa',
    'Lesotho': 'Africa',
    'Liberia': 'Africa',
    'Libya': 'Africa',
    'Madagascar': 'Africa',
    'Malawi': 'Africa',
    'Mali': 'Africa',
    'Mauritania': 'Africa',
    'Mauritius': 'Africa',
    'Morocco': 'Africa',
    'Mozambique': 'Africa',
    'Namibia': 'Africa',
    'Niger': 'Africa',
    'Nigeria': 'Africa',
    'Rwanda': 'Africa',
    'Sao Tome and Principe': 'Africa',
    'Senegal': 'Africa',
    'Seychelles': 'Africa',
    'Sierra Leone': 'Africa',
    'Somalia': 'Africa',
    'South Africa': 'Africa',
    'South Sudan': 'Africa',
    'Sudan': 'Africa',
    'Tanzania': 'Africa',
    'Togo': 'Africa',
    'Tunisia': 'Africa',
    'Uganda': 'Africa',
    'Zambia': 'Africa',
    'Zimbabwe': 'Africa',

    # Asia
    'Afghanistan': 'Asia',
    'Armenia': 'Asia',
    'Azerbaijan': 'Asia',
    'Bahrain': 'Asia',
    'Bangladesh': 'Asia',
    'Bhutan': 'Asia',
    'Brunei': 'Asia',
    'Cambodia': 'Asia',
    'China': 'Asia',
    'Cyprus': 'Asia',
    'Georgia': 'Asia',
    'India': 'Asia',
    'Indonesia': 'Asia',
    'Iran': 'Asia',
    'Iraq': 'Asia',
    'Israel': 'Asia',
    'Japan': 'Asia',
    'Jordan': 'Asia',
    'Kazakhstan': 'Asia',
    'Kuwait': 'Asia',
    'Kyrgyzstan': 'Asia',
    'Laos': 'Asia',
    'Lebanon': 'Asia',
    'Malaysia': 'Asia',
    'Maldives': 'Asia',
    'Mongolia': 'Asia',
    'Myanmar': 'Asia',
    'Nepal': 'Asia',
    'North Korea': 'Asia',
    'Oman': 'Asia',
    'Pakistan': 'Asia',
    'Palestine': 'Asia',
    'Philippines': 'Asia',
    'Qatar': 'Asia',
    'Saudi Arabia': 'Asia',
    'Singapore': 'Asia',
    'South Korea': 'Asia',
    'Sri Lanka': 'Asia',
    'Syria': 'Asia',
    'Taiwan': 'Asia',
    'Tajikistan': 'Asia',
    'Thailand': 'Asia',
    'Timor-Leste': 'Asia',
    'Turkey': 'Asia',
    'Turkmenistan': 'Asia',
    'United Arab Emirates': 'Asia',
    'Uzbekistan': 'Asia',
    'Vietnam': 'Asia',
    'Yemen': 'Asia',

    # Europe
    'Albania': 'Europe',
    'Andorra': 'Europe',
    'Armenia': 'Europe',
    'Austria': 'Europe',
    'Azerbaijan': 'Europe',
    'Belarus': 'Europe',
    'Belgium': 'Europe',
    'Bosnia and Herzegovina': 'Europe',
    'Bulgaria': 'Europe',
    'Croatia': 'Europe',
    'Cyprus': 'Europe',
    'Czech Republic': 'Europe',
    'Denmark': 'Europe',
    'Estonia': 'Europe',
    'Finland': 'Europe',
    'France': 'Europe',
    'Georgia': 'Europe',
    'Germany': 'Europe',
    'Greece': 'Europe',
    'Hungary': 'Europe',
    'Iceland': 'Europe',
    'Ireland': 'Europe',
    'Italy': 'Europe',
    'Kazakhstan': 'Europe',
    'Kosovo': 'Europe',
    'Latvia': 'Europe',
    'Liechtenstein': 'Europe',
    'Lithuania': 'Europe',
    'Luxembourg': 'Europe',
    'Malta': 'Europe',
    'Moldova': 'Europe',
    'Monaco': 'Europe',
    'Montenegro': 'Europe',
    'Netherlands': 'Europe',
    'North Macedonia': 'Europe',
    'Norway': 'Europe',
    'Poland': 'Europe',
    'Portugal': 'Europe',
    'Romania': 'Europe',
    'Russian Federation': 'Europe',
    'San Marino': 'Europe',
    'Serbia': 'Europe',
    'Slovakia': 'Europe',
    'Slovenia': 'Europe',
    'Spain': 'Europe',
    'Sweden': 'Europe',
    'Switzerland': 'Europe',
    'Ukraine': 'Europe',
    'United Kingdom': 'Europe',
    'Vatican City': 'Europe',

    # North America
    'Antigua and Barbuda': 'North America',
    'Bahamas': 'North America',
    'Barbados': 'North America',
    'Belize': 'North America',
    'Canada': 'North America',
    'Costa Rica': 'North America',
    'Cuba': 'North America',
    'Dominica': 'North America',
    'Dominican Republic': 'North America',
    'El Salvador': 'North America',
    'Grenada': 'North America',
    'Guatemala': 'North America',
    'Haiti': 'North America',
    'Honduras': 'North America',
    'Jamaica': 'North America',
    'Mexico': 'North America',
    'Nicaragua': 'North America',
    'Panama': 'North America',
    'Saint Kitts and Nevis': 'North America',
    'Saint Lucia': 'North America',
    'Saint Vincent and the Grenadines': 'North America',
    'Trinidad and Tobago': 'North America',
    'United States': 'North America',

    # South America
    'Argentina': 'South America',
    'Bolivia': 'South America',
    'Brazil': 'South America',
    'Chile': 'South America',
    'Colombia': 'South America',
    'Ecuador': 'South America',
    'Guyana': 'South America',
    'Paraguay': 'South America',
    'Peru': 'South America',
    'Suriname': 'South America',
    'Uruguay': 'South America',
    'Venezuela': 'South America',

    # Oceania
    'Australia': 'Oceania',
    'Fiji': 'Oceania',
    'Kiribati': 'Oceania',
    'Marshall Islands': 'Oceania',
    'Micronesia': 'Oceania',
    'Nauru': 'Oceania',
    'New Zealand': 'Oceania',
    'Palau': 'Oceania',
    'Papua New Guinea': 'Oceania',
    'Samoa': 'Oceania',
    'Solomon Islands': 'Oceania',
    'Tonga': 'Oceania',
    'Tuvalu': 'Oceania',
    'Vanuatu': 'Oceania',
}

country_to_continent.update({
    'United States of America': 'North America',
    'United Kingdom of Great Britain and Northern Ireland': 'Europe',
    'Iran, Islamic Republic of...': 'Asia',
    'Viet Nam': 'Asia',
    'Hong Kong (S.A.R.)': 'Asia',
    'United Republic of Tanzania': 'Africa',
    'Syrian Arab Republic': 'Asia',
    'Republic of Moldova': 'Europe',
    'Republic of Korea': 'Asia',
    'Isle of Man': 'Europe',
    'Venezuela, Bolivarian Republic of...': 'South America',
    'Congo, Republic of the...': 'Africa',
    'Nomadic': 'Other'  
})

df_ai['Continent'] = df_ai['Country'].map(country_to_continent).fillna('Other')
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/3550603098.py:232: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

What are the expertise of data scientists? Like measure it by

Professional Experience¶

We will categorize the years of coding experience into groups to understand the distribution of experience levels among data scientists. This will help us identify the experience levels that are most common in the field.

Insight:¶

  • Most data scientists have between 2-5 years of coding experience.
In [11]:
import plotly.express as px

def categorize_years_code(years):
    if years == "Less than 1 year":
        return "0-1"
    try:
        years = int(years)
        if years == 1:
            return "1"
        elif 2 <= years <= 5:
            return "2-5"
        elif 6 <= years <= 10:
            return "6-10"
        else:
            return "10+"
    except ValueError:
        return years

df_ai.loc[:, 'YearsCodeGroup'] = df_ai['YearsCodePro'].apply(categorize_years_code)

years_code_counts = df_ai['YearsCodeGroup'].value_counts().sort_values()

fig = px.bar(years_code_counts, 
             orientation='h', 
             title='Coding Experience', 
             labels={'value': 'Number of Respondents', 'index': 'Years of Coding Experience'},
             color_discrete_sequence=['skyblue'])

fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Years of Coding Experience')

fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2128694638.py:19: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Age¶

We will create a horizontal bar chart to visualize the distribution of ages among data scientists. This will help us understand the age groups that are most common in the field.

Insight:¶

  • Most data scientists are between the ages of 25-34.
  • The age distribution is skewed towards younger developers, with fewer developers in the older age groups.
In [12]:
import plotly.express as px

age_distribution = df_ai['Age'].value_counts()
age_distribution = age_distribution.sort_values()

fig = px.bar(age_distribution, 
             orientation='h', 
             title='Ages of data scientists', 
             labels={'value': 'Number of Respondents', 'index': 'Age Group'},
             color_discrete_sequence=['skyblue'])

fig.update_layout(xaxis_title='Number of Respondents', yaxis_title='Age Group')

fig.show()

Employment Status¶

We will categorize the employment status of data scientists into simple groups like:

  • Full Time
  • Partial
  • Freelancer

Insight:¶

  • Most data scientists are employed full-time, followed by freelancers and part-time workers.
  • A small percentage of data scientists are not employed.
  • The majority of data scientists work full-time.
  • So this discipline requires from you to be full-time employed.
In [13]:
import plotly.graph_objects as go

def categorize_employment(status):
    if "full-time" in status and "freelancer" not in status and "self-employed" not in status:
        return "Full Time"
    elif "part-time" in status or "Student" in status:
        return "Partial"
    elif "freelancer" in status or "self-employed" in status:
        return "Freelancer"
    else:
        return "Non Employed"

df_ai.loc[:, 'EmploymentStatus'] = df_ai['Employment'].apply(categorize_employment)

employment_counts = df_ai['EmploymentStatus'].value_counts()

fig = go.Figure(data=[go.Pie(labels=employment_counts.index, 
                             values=employment_counts.values, 
                             textinfo='percent+label', 
                             hole=.3)])

fig.update_layout(title_text='Employment Status')

fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2605551934.py:13: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Remote Work¶

COVID-19 has changed the way people work, with many companies adopting remote work policies. We will create a pie chart to visualize the distribution of remote work status.

Insight:¶

  • The hybrid model is the most common remote work status among data scientists.
  • A significant percentage of data scientists work fully remotely.
  • A small percentage of data scientists do not work remotely.
In [14]:
import plotly.graph_objects as go

remote_work_counts = df_ai['RemoteWork'].value_counts()

fig = go.Figure(data=[go.Pie(labels=remote_work_counts.index, 
                             values=remote_work_counts.values, 
                             textinfo='percent', 
                             hole=.3)])

fig.update_layout(title_text='Remote Work Status')

fig.show()

Education Level vs Coding Experience¶

We will create a stacked bar chart to visualize the distribution of education levels among data scientists based on their coding experience. This will help us understand the relationship between education and experience levels in the field.

Insight:¶

  • Most data scientists have a Bachelor's or Master's degree.
  • The distribution of education levels is consistent across different experience levels.
  • Junior developers are more likely to have a Bachelor's degree, while Senior developers are more likely to have a Master's degree.
  • As data science students, you should have a Bachelor's or Master's degree and you can obtain a job as a Junior developer.
In [15]:
import plotly.express as px
import pandas as pd

def categorize_experience(years):
    if years in ["Less than 1 year", "0-1", "1"]:
        return "Junior"
    try:
        years = int(years)
        if years <= 5:
            return "Junior"
        elif 6 <= years <= 10:
            return "Semi Senior"
        else:
            return "Senior"
    except ValueError:
        return "Other"

df_ai.loc[:,'ExperienceLevel'] = df_ai['YearsCodePro'].apply(categorize_experience)
valid_categories = ['Junior', 'Semi Senior', 'Senior', 'Other']
df_ai.loc[:,'ExperienceLevel'] = pd.Categorical(df_ai['ExperienceLevel'], categories=valid_categories, ordered=True)

df_ai.loc[:,'EdLevelCopy'] = df_ai.loc[:,'EdLevel'].copy()

ed_level_map = {
    "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)": "Master's",
    "Bachelor’s degree (B.A., B.S., B.Eng., etc.)": "Bachelor's",
    "Professional degree (JD, MD, Ph.D, Ed.D, etc.)": "Professional",
    "Some college/university study without earning a degree": "Some College",
    "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": "Secondary",
    "Associate degree (A.A., A.S., etc.)": "Associate",
    "Something else": "Other",
    "Primary/elementary school": "Primary"
}

df_ai.loc[:,'EdLevelCopy'] = df_ai.loc[:,'EdLevelCopy'].map(ed_level_map)

grouped_data = df_ai.groupby(['EdLevelCopy', 'ExperienceLevel'], observed=True).size().unstack(fill_value=0)

grouped_data = grouped_data.loc[grouped_data.sum(axis=1).sort_values(ascending=False).index]

fig = px.bar(grouped_data, 
             x=grouped_data.index, 
             y=grouped_data.columns,
             title='Education Level vs Coding Experience',
             labels={'value': 'Count', 'EdLevelCopy': 'Education Level'},
             barmode='stack')

fig.update_traces(texttemplate='%{value}', textposition='inside')

fig.update_layout(xaxis_title='Education Level', yaxis_title='Count', legend_title_text='Experience Level')

fig.show()
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2202332092.py:18: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/2202332092.py:22: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

2. Salary Expectations for data scientists¶

Salary Distribution¶

We will create a boxplot to visualize the distribution and skewness of salary. This will help us understand the typical salary range for data scientists and also normalize our data by filtering out the top 10% of salaries to get a better understanding of the salary distribution.

Insight:¶

  • Our data is scattered and has a long tail, indicating a wide range of salaries.
In [16]:
import plotly.graph_objects as go

total_compensation = df_ai['CompTotal'].dropna()

percentile_90 = total_compensation.quantile(0.90)

df_ai_filtered = df_ai[df_ai['CompTotal'] <= percentile_90]

fig = go.Figure()

fig.add_trace(go.Box(y=df_ai_filtered['CompTotal'], boxpoints='all', jitter=0.3, pointpos=-1.8))

fig.update_layout(title='Box Plot of CompTotal (Filtered Below 90th Percentile)',
                  yaxis_title='Compensation Total',
                  xaxis=dict(visible=False),
                  showlegend=False)

fig.show()

print("Original count:", len(total_compensation))
print("Filtered count:", len(df_ai_filtered))
print("90th percentile value:", percentile_90)
outliers_count = df_ai[df_ai['CompTotal'] > percentile_90].shape[0]
print("Number of values above the 90th percentile:", outliers_count)
Original count: 2186
Filtered count: 1977
90th percentile value: 1200000.0
Number of values above the 90th percentile: 209

Normal Distribution¶

We will create a histogram of salary data and fit a normal distribution curve to it.

Insight:¶

  • The histogram of salary data is not normally distributed.
  • The fitted normal distribution curve does not match the histogram, indicating that the data is not normally distributed.
  • We can see that the data is right-skewed, with a long tail on the right side of the distribution.
  • For that we are going to adjust the outliers to get a better understanding of the salary distribution.
In [17]:
import numpy as np
import seaborn as sns
import scipy.stats as stats

filtered_compensation = df_ai_filtered['CompTotal']

plt.figure(figsize=(10, 6))
sns.histplot(filtered_compensation, kde=False, bins=30, color='blue', stat='density')

# Fit a normal distribution to the data
mu, std = stats.norm.fit(filtered_compensation)

# Plot the normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)

# Adjust the normalization of the curve to match the histogram's density
plt.plot(x, p, 'k', linewidth=2, color='red')

plt.title('Histogram of CompTotal with Fitted Normal Distribution')
plt.xlabel('Compensation Total')
plt.ylabel('Density')
plt.grid(True)
plt.show()

# Print the mean and standard deviation for reference
print(f"Mean: {mu}")
print(f"Standard Deviation: {std}")
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/470756462.py:19: UserWarning:

color is redundantly defined by the 'color' keyword argument and the fmt string "k" (-> color=(0.0, 0.0, 0.0, 1)). The keyword argument will take precedence.

No description has been provided for this image
Mean: 171242.17197774406
Standard Deviation: 215516.05052381053

Salary by Experience Level¶

Created a boxplot to visualize salary by experience level, because we want to understand if the skewness is related to the experience level. But we can see that the skewness is relevant on the three experience levels.

Insight:¶

  • For that we are going to adjust the outliers to get a better understanding of the salary distribution by removing the top 25% of the data.
In [18]:
import plotly.express as px

fig = px.box(df_ai_filtered, 
             x='ExperienceLevel', 
             y='CompTotal', 
             category_orders={'ExperienceLevel': valid_categories},
             title='Boxplot of Compensation by Experience Level',
             labels={'ExperienceLevel': 'Experience Level', 'CompTotal': 'Compensation Total'})

fig.show()

Now removing the top 25% of the data to get a better understanding of the salary distribution

As we can see the data now seems to be normally distributed, and low skewness. So we can do another graphic that will show the distribution of the data.

In [19]:
import plotly.graph_objects as go

total_compensation = df_ai['CompTotal'].dropna()

percentile_75 = total_compensation.quantile(0.75)

df_ai_filtered_2 = df_ai[df_ai['CompTotal'] <= percentile_75]

fig = go.Figure()

fig.add_trace(go.Box(y=df_ai_filtered_2['CompTotal'], boxpoints='all', jitter=0.3, pointpos=-1.8))

fig.update_layout(title='Box Plot of CompTotal (Filtered Below 75th Percentile)',
                  yaxis_title='Compensation Total',
                  xaxis=dict(visible=False),
                  showlegend=False)

fig.show()

Normal Distribution¶

We are going to perform a histogram of salary data and fit a normal distribution curve to it using the library scipy.stats, that this library will help us to fit the normal distribution curve to the data.

Insight:¶

  • The histogram of salary data is normally distributed.
  • The fitted normal distribution curve matches the histogram, indicating that the data is normally distributed.
  • We are going to use this data to display the other charts.
In [20]:
import numpy as np
import seaborn as sns
import scipy.stats as stats

filtered_compensation = df_ai_filtered_2['CompTotal']

plt.figure(figsize=(10, 6))
sns.histplot(filtered_compensation, kde=False, bins=30, color='blue', stat='density')

mu, std = stats.norm.fit(filtered_compensation)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)

plt.plot(x, p, 'k', linewidth=2, color='red')

plt.title('Histogram of CompTotal with Fitted Normal Distribution')
plt.xlabel('Compensation Total')
plt.ylabel('Density')
plt.grid(True)
plt.show()

print(f"Mean: {mu}")
print(f"Standard Deviation: {std}")
/var/folders/kb/36m1pbjn6mv0pb93gwx6lysh0000gn/T/ipykernel_87805/1092531397.py:16: UserWarning:

color is redundantly defined by the 'color' keyword argument and the fmt string "k" (-> color=(0.0, 0.0, 0.0, 1)). The keyword argument will take precedence.

No description has been provided for this image
Mean: 92786.49115314215
Standard Deviation: 54162.009649499676

Salary vs Experience Level¶

We'll create a boxplot to understand the salary vs experience level.

Insight¶

  • As the tech industry, with more experience the salary is higher, and we can see that in the graphic below.
  • Junior devs are bteween USD 40000 and USD100000
  • Semi Senior devs are between USD 60000 and USD 140000
  • Senior devs are between USD 70000 and USD 160000
In [21]:
import plotly.express as px

fig = px.box(df_ai_filtered_2, 
             x='ExperienceLevel', 
             y='CompTotal', 
             category_orders={'ExperienceLevel': valid_categories},
             title='Boxplot of Compensation by Experience Level',
             labels={'ExperienceLevel': 'Experience Level', 'CompTotal': 'Compensation Total'})

fig.show()

Country¶

We'll create a choropleth chart, that will help us to visualize the distribution of data scientists by country. This will help us understand the countries that have the highest number of data scientists.

Insight:¶

  • Most data scientists are from the United States, followed by India and Germany.
  • South America and Africa have the lowest number of data scientists.
In [22]:
import plotly.express as px

country_counts = df_ai['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Count']

fig = px.choropleth(
    country_counts,
    locations='Country',
    locationmode='country names',
    color='Count',
    hover_name='Country',
    color_continuous_scale=px.colors.sequential.Plasma, 
    title='Country Chart to Represent the respondents'
)

fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',  
    ),
    title={
        'text': 'Country Chart to Represent the respondents',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    margin=dict(l=0, r=0, t=0, b=0) 
)

fig.update_coloraxes(colorbar_title="People", colorbar=dict(
    len=0.75,  
    thickness=15,  
))

fig.show()

Now we are going to exclude the United States to get a better understanding of the distribution of data scientists by country.

In [23]:
df_ai_exclude_us = df_ai[df_ai['Country'] != 'United States of America']
country_counts = df_ai_exclude_us['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Count']

fig = px.choropleth(
    country_counts,
    locations='Country',
    locationmode='country names',
    color='Count',
    hover_name='Country',
    color_continuous_scale=px.colors.sequential.Plasma, 
    title='Country Chart to Represent the respondents'
)

fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',  # Adjust projection for better global representation
    ),
    title={
        'text': 'Country Chart to Represent the respondents',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    margin=dict(l=0, r=0, t=50, b=0)  # Adjust margins to center the map
)

fig.update_coloraxes(colorbar_title="People", colorbar=dict(
    len=0.75,  # Adjust length
    thickness=15,  # Adjust thickness
))

fig.show()

Continent¶

Insight¶

  • Most data scientists are from Europe, North America adn Asia.
  • Few people are from Oceania and Africa
In [24]:
import plotly.graph_objects as go

continent_counts = df_ai['Continent'].value_counts()

labels = continent_counts.index
sizes = continent_counts.values

fig = go.Figure(data=[go.Pie(labels=labels, values=sizes, hole=.3)])

fig.update_traces(textinfo='percent+label')

fig.update_layout(title_text='Distribution of Entries by Continent')

fig.show()

Salary by Continent¶

Insight:¶

  • The highest salaries are in North America.
  • The lowest salaries are in Africa.
  • In Asia are a wide range of salaries.
  • Europe has standardized salaries, but some salaries are above 150000.
In [25]:
import plotly.express as px

fig = px.box(df_ai_filtered_2, 
             x='Continent', 
             y='CompTotal',
             title='Boxplot of Compensation Salary by Continent',
             labels={'CompTotal': 'Compensation Salary', 'Continent': 'Continent'})

fig.show()

Salary by Continent¶

Insight:¶

  • The highest salaries are in North America.
  • In south america is better to work remotely rather than in the office.
In [26]:
import plotly.express as px

fig = px.box(df_ai_filtered_2, 
             x='RemoteWork', 
             y='CompTotal', 
             color='Continent',
             title='Boxplot of Compensation Total by Remote Work Category and Continent',
             labels={'RemoteWork': 'Remote Work Category', 'CompTotal': 'Compensation Total'},
             category_orders={'Continent': df_ai_filtered_2['Continent'].unique()})

fig.update_layout(legend_title_text='Continent')

fig.show()

3. Most popular languages, databases, and platforms among data scientists¶

So we are going to display treemap to visualize the most popular languages, databases, and platforms among data scientists.

Insights:¶

  • Language: The most popular language among data scientists is Python.
  • Database: The most popular database among data scientists is PostgresSQL and MySQL, relational databases are the most popular.
  • Cloud Platform: The most popular cloud platform among data scientists is AWS, followed by Azure and GCP. Databricks is taking a momentum.
  • Web Framework: FASTAPI is the most popular web framework among data scientists.
  • Miscellaneous Tech: Pandas and Numpy are relevant on this field.
  • Tools: Docker and PIPL are the most popular tools
  • Collaboration Tools: VSCode and Jupiter are the most popular collaboration tools.
  • Office Stack Async: Jira is the most popular tools.
  • Office Stack Sync: Slack and Microsoft Teams are the most popular tools.
  • AI usage: ChatGPT, Github Copilot and Gemini are the most popular tools.
In [27]:
def plot_treemap_from_column(df_column):
    languages_series = df_column.str.split(';').explode()

    language_counts = languages_series.value_counts().reset_index()
    language_counts.columns = ['Technology', 'Count']

    fig = px.treemap(language_counts, path=['Technology'], values='Count', 
                     color='Count', color_continuous_scale='Viridis')

    fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
    fig.show()
In [28]:
plot_treemap_from_column(df_ai['LanguageHaveWorkedWith'])
In [29]:
plot_treemap_from_column(df_ai['LanguageWantToWorkWith'])
In [30]:
plot_treemap_from_column(df_ai['DatabaseHaveWorkedWith'])
In [31]:
plot_treemap_from_column(df_ai['DatabaseWantToWorkWith'])
In [32]:
plot_treemap_from_column(df_ai['PlatformHaveWorkedWith'])
In [33]:
plot_treemap_from_column(df_ai['PlatformWantToWorkWith'])
In [34]:
plot_treemap_from_column(df_ai['WebframeHaveWorkedWith'])
In [35]:
plot_treemap_from_column(df_ai['WebframeWantToWorkWith'])
In [36]:
plot_treemap_from_column(df_ai['MiscTechHaveWorkedWith'])
In [37]:
plot_treemap_from_column(df_ai['MiscTechWantToWorkWith'])
In [38]:
plot_treemap_from_column(df_ai['ToolsTechHaveWorkedWith'])
In [39]:
plot_treemap_from_column(df_ai['ToolsTechWantToWorkWith'])
In [40]:
plot_treemap_from_column(df_ai['NEWCollabToolsHaveWorkedWith'])
In [41]:
plot_treemap_from_column(df_ai['NEWCollabToolsWantToWorkWith'])
In [42]:
plot_treemap_from_column(df_ai['OfficeStackAsyncHaveWorkedWith'])
In [43]:
plot_treemap_from_column(df_ai['OfficeStackAsyncWantToWorkWith'])
In [44]:
plot_treemap_from_column(df_ai['OfficeStackSyncHaveWorkedWith'])
In [45]:
plot_treemap_from_column(df_ai['OfficeStackSyncWantToWorkWith'])
In [46]:
plot_treemap_from_column(df_ai['AISearchDevHaveWorkedWith'])
In [47]:
plot_treemap_from_column(df_ai['AISearchDevWantToWorkWith'])